Artificial intelligence exceeds humans in epidemiological job coding

Background Work circumstances can substantially negatively impact health. To explore this, large occupational cohorts of free-text job descriptions are manually coded and linked to exposure. Although several automatic coding tools have been developed, accurate exposure assessment is only feasible with human intervention. Methods We developed OPERAS, a customizable decision support system for epidemiological job coding. Using 812,522 entries, we developed and tested classification models for the Professions et Catégories Socioprofessionnelles (PCS)2003, Nomenclature d’Activités Française (NAF)2008, International Standard Classifications of Occupation (ISCO)-88, and ISCO-68. Each code comes with an estimated correctness measure to identify instances potentially requiring expert review. Here, OPERAS’ decision support enables an increase in efficiency and accuracy of the coding process through code suggestions. Using the Formaldehyde, Silica, ALOHA, and DOM job-exposure matrices, we assessed the classification models’ exposure assessment accuracy. Results We show that, using expert-coded job descriptions as gold standard, OPERAS realized a 0.66–0.84, 0.62–0.81, 0.60–0.79, and 0.57–0.78 inter-coder reliability (in Cohen’s Kappa) on the first, second, third, and fourth coding levels, respectively. These exceed the respective inter-coder reliability of expert coders ranging 0.59–0.76, 0.56–0.71, 0.46–0.63, 0.40–0.56 on the same levels, enabling a 75.0–98.4% exposure assessment accuracy and an estimated 19.7–55.7% minimum workload reduction. Conclusions OPERAS secures a high degree of accuracy in occupational classification and exposure assessment of free-text job descriptions, substantially reducing workload. As such, OPERAS significantly outperforms both expert coders and other current coding tools. This enables large-scale, efficient, and effective exposure assessment securing healthy work conditions.

The original article of the Lifework [1] data set did not detail the manual coding procedures.However, the corresponding author of the article has now provided the manual coding procedures, as described in the main article.

Description of XGBoost
XGBoost [2] computes a series of individually weak Classification and Regression Trees (CARTs) and combines them to create one, better-performing ensemble.For a given dataset D = {(x i , y i )} with n examples and m features, a prediction ( ŷi ) is assigned to an instance by summing the scores (i.e., the leaf weights) of the leaves of each CART in the ensemble for that instance.This is defined as: where K is the number of CARTs, and f is a CART out of the set of all possible CARTs F. In the current context of multiclass classification with L outcome categories, XGBoost computes L CARTs with a binary outcome in each iteration.Consequently, the algorithm computes an ensemble of L binary outcome ensembles, each containing K CARTs to classify job descriptions x i into an occupational code y i .
To learn the set of CARTs used in the ensemble, the following regularized objective is minimized: where l is the loss function that measures the residual error between the target y i and prediction ŷi .To reduce the chance of overfitting, the second term Ω(f i ) penalizes the complexity of the model [2].This is the sum of the complexity of each individual CART in the ensemble.To define the complexity of a CART Ω(f ), the definition of a CART f (x) is first defined as: Here, w is the vector of scores on the leaves of CART structure q, and T is the number of leaves.Using this definition, XGBoost defines the complexity of a CART Ω(f ) as: where w j is the score on the j-th leaf of a CART, and γ and λ are optimizable hyperparameters controlling the penalty for the number of leaves and the magnitude of the leaf weights, respectively.Each ensemble is computed in an additive manner, where a new CART is fitted to the residual errors of the previous iteration.Using Eq. ( 2), the objective function at the t-th step becomes: When a new CART is trained, the hyperparameters sumbsample, colsam-ple_bytree, and colsample_bylevel specify what proportion of the training data is used.Here, subsample is the fraction of the training samples used in each boosting round.Lowering this value could increase computational speed and reduce the chance of overfitting [2].Hyperparameters colsample_bytree and colsample_bylevel are the subsample ratios of columns that are used during training of each CART and level, respectively.This is a method used in RandomForest algorithms, and could prevent overfitting [3].However, because the subsampling of columns could result in the loss of important interactions between features of the job descriptions, they were set to include all columns [4].
Depending on the problem the Machine Learning (ML) model is solving (e.g., regression or classification problems), different loss functions might be required.Because of the difference in computational complexity of loss functions, XGBoost uses the Taylor expansion up to the second order to approximate the value of the loss function around a given point.For example, this could be used in the current context to approximate the local minimum of a function using logistic loss as its loss function.Applying Taylors expansion up to the second order to Eq. ( 5) results in the following objective function: where As the instance set of leaf j is defined as I j = {i|q(x i ) = j}, Eq. ( 6) is rewritten by expanding Ω as: Consequently, the optimal leaf weight w * j of leaf j for a fixed CART structure q(x) can be computed by: To secure possible overfitting, the weights of the new features are reduced by hyperparameter λ and restricted to a maximum weight by max_delta_step with each boosting step.As λ has a direct influence on h i , increasing its value will reduce the chance of weights getting too large and thus reduce the chance of overfitting.As seen in Eq. ( 9), h i can become very small with imbalanced datasets such as the current ones (see Table 2), resulting in very large weights.
Restricting the maximum weight will ensure that one leaf does not have too much influence on ŷ, making the model more conservative.To further prevent overfitting, the XGBoost algorithm applies shrinkage of leaf weights as introduced by Friedman [5].Shrinkage scales newly added weights w * j by a factor η to reduce the influence of an individual CART.This leaves space for future CARTs to improve the ensemble.
To calculate the optimal corresponding leaf weight w * j , the following equation is used: This equation is also used as a scoring function to measure the quality of a CART structure q(x).
Because of the amount of different possible combinations of splits, enumerating all possible CART structures q(x) is intractable.Hence, a greedy algorithm uses the following formula to iteratively find the split providing the most information gain.
Here, I L and I R are the instance sets of the left and right node after the split, respectively.The algorithm will compute the next best splits iteratively until no further gain can be found (i.e., gain <0) or the CART has reached its maximum depth.Both stopping conditions can be empirically optimized through the hyperparameters γ and max_depth, respectively.Here, γ is the minimum gain each split should produce.In the current domain, too large values of γ could result in the addition of splits including irrelevant interactions between features of job descriptions [4].Whereas a too small value will result in important interactions being missed.The hyperparameter max_depth refers to the maximum depth of a CART.By lowering this value the CARTs in the model will contain fewer splits, making it less likely to overfit on the training data [6].